This course Open Data Science requires several skills from statisticians and other data scientists. In this course we will learn to code, recode, program, visualize, analyze, and interpret.
1.Read the students2014 data into R either from your local folder or from this url. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
#Read thedata into R from the url
lrn14 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt")
#the structure of the data
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#the dimensions of the data
dim(lrn14)
## [1] 166 7
Description of the data
This data consists of 166 observations and 7 variables (gender, age, attitude, deep, stra, surf, points).
Gender is factor variable; age, attitude and points are integer variables;and deep, stra and surf are numeric variables.
Note also that
gender: M (Male), F (Female)
age (in years) was derived from the date of birth
attitude = global attitude toward statistics
deep = deep learning, e.g., “I usually set out to understand for myself the meaning of what we have to learn (D03)”"
stra = strategy learning, e.g., “I organize my study time carefully to make the best use of it (ST04)”
surf = surface learning, e.g., “I find I have to concentrate on just memorising a good deal of what I have to learn (SU05)”
2.Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
2.1. A scatter plot matrix of the variables
# draw a scatter plot matrix of the variables.
pairs(lrn14)
2.2. More advanced plot
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
2.3. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them:
Distributions
There were more female students than male students.
As both figure showed the distribution of age variable is positioned around the age of 20 and participants aged over 30 years is limited. The distribution is highly skewed to the left.
The distribution of the attitude variable is quite normal and data is positioned around 3 (likert scale 0-5).
The distribution of the deep variable is quite normal (slightly skewed to the right) and data is positioned around 3.5 (likert scale 0-5).
The distribution of the stra variable is quite normal and data is distributed around 3 (likert scale 0-5).
The distribution of the surf variable is quite normal (slightly skewed to the left) and data is positioned around 2.5 (likert scale 0-5).
The distribution of the points variable is quite normal (slightly skewed to the right) and data is positioned around 25.
Relationships:
Correlations between age and attitude (r = .02), age and deep (r= .03), and age and points (r=-.09) were low, and correlations between age and stra (r=.10), and age and surf (r=-.14) are small but higher; indicating that when the age increases, the strategy learning increases (stra) and the surface learning (surf) decreases.
Correlation between attitude and exam points was quite high (r=.44): indicating that higher attitude toward statistics related to higher exam points. Correlation between attitude and surface learning was negative, but quite small (r=-.18); indicating that higher attitude toward statistics related to lower levels of the surface learning. Other correlations were low (between attitude and stra (r=.06) and attitude and deep (r=.11))
Correlation between deep and surf was quite high (r=-.32); indicating that that higher deep learning level related to lower surface learning level. In fact, I would have assumed a higher level of correlation; perhaps students use both ways to learn. Correlations between deep and points (r=-.01) and deep and stra (r=.096) were low.
Correlations between stra and surf (r=-.16) and stra and points (r=.15) were quite low; indicating that higher strategy learning related lower surface learning and higher strategy learning higher exam points.
Furthermore, there was small negative correlation between surf and points (r=-.14); indicating that higher surface learning related lower exam points.
3. Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = lrn14)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Interpret the results. Explain and interpret the statistical test related to the model parameters.
Residuals: model is created in such a way that it minimizes the sum of the squared residuals, which are the prediction errors (y2-y, y2=predictions, y=actual variable))
Estimate Std described = beeta: The main point is estimate the beeta parameters. Only attitude explained significantly (beeta=3.40, p=.003) exam points; indicating that when attitude change by one unit, the average change in exam points is 3.40 points.
In other words, statistic t-test indicated that there is some statistically significant relationship between exam points and attitude. More specially, hypothesis H0: beeta=0 is rejected, implying that a model does exist between x and y.
On the other hand, variables stra (strategy learning) (beeta=0.85, p=0.11) and surf (surface learning) (beeta=-0.59, p=.47) did not explained significantly exam points, as the statistic t-tests are not significant levels.
Note that Std. Error= standardized error (highest for surface learning). It represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. As values were quite small it indicates that the observations are quite close to the fitted line.
If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
Now, we remove variables stra and surf:
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude, data = lrn14)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
4. Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R squared of the model.
As we mentioned earlier, the relationship between the exam points and attitude was significant (beeta=3.53, p<.001); indicating that when attitude change by one unit, the average change in exam points is 3.53 points. Note that we look now the model where the nonsignificant variables were removed. The multiple R-squared was R2=.19; indicating that attitude explained 19% of the variance in exam points.
5. Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
Noted that only significant explanatory variable attitude was included in the model.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which = c(1:2, 5))
Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
The assumption that the size of a given error does not depend on the explanatory variables can be explored with a simple scatter plot of residuals versus model predictions (figure 1). As we can see, there is reasonable random spread and not any patter; indicating that this assumption is realized.
QQ-plot of the residuals (figure 2) provides a method to explore the assumption that the errors of the model are normally distributed. As the plots were quite near on the line (fit on the line), the assumption of the normality is realized.
We also need to investigate problematic observations such as outliers. Leverage measures how much impact a single observation has on the model. In the figure 3, residuals vs. leverage plot can help identify which observations have an unusually high impact. As no single observations does not arise (e.g., outliers), none of the single observations does not have a great impact on the model. However, perhaps items such as 71, 56, and 35 should be looked more deeply, and investigate whether their removal would impact on the model.
Exercise 3, Marja Holm, 10.2.2017
2. Read the joined student alcohol consumption data into R either from your local folder or from the url. Print out the names of the variables in the data and describe the data set briefly, assuming the reader has no previous knowledge of it.
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = "," , header=TRUE)
#the structure of the data
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
#the dimensions of the data
dim(alc)
## [1] 382 35
Describe the data set briefly, assuming the reader has no previous knowledge of it.
This data consists of 382 observations and 35 variables.
Factor variables:
Integer variables:
Numeric:
Logical:
3. The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
We selected variables that might have relationships with alcohol consumption.
H1. Going out with friends (goout) might have a positive association with alcohol consumption (alc_use).
H2. Weekly study time (studytime) might have a negative association with alcohol consumption (alc_use).
H3. Students’ age (age: 15 to 20) might have a positive association with alcohol consumption (alc_use); indicating that older students might have more alcohol consumption than younger ones .
H4. Number of past class failures (failures) might have a positive association with alcohol consumption (alc_use).
4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
First, the data (named: dta) including only study variables (goout, studytime, age, failures, alc_use, high_use) were created.
#Access the dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Choosing what columns are kept for analysis dataset
keep_columns<-c("goout","studytime","age", "failures", "alc_use", "high_use")
#Access the dplyr library
library(dplyr)
#Creating analysis dataset that includes these variables
dta <- select(alc, one_of(keep_columns))
str(dta)
## 'data.frame': 382 obs. of 6 variables:
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ studytime: int 2 2 2 3 2 2 2 2 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
#Saving the R.file into scv format. Before this the working directory has been changed to IODS-project and under Data folder.
write.table(dta, file="dta.csv")
dta<-read.table("dta.csv")
4.1. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# a more advanced plot matrix with ggpairs()
p <- ggpairs(dta, mapping = aes(),lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Figure 1. Correlations and distributions
ta0<-xtabs(~alc_use+goout, data=dta)
ta0
## goout
## alc_use 1 2 3 4 5
## 1 16 49 45 25 9
## 1.5 3 22 29 8 6
## 2 2 13 26 10 7
## 2.5 2 7 13 16 4
## 3 0 4 3 14 11
## 3.5 1 2 4 3 7
## 4 0 0 1 5 3
## 4.5 0 1 0 1 1
## 5 0 1 2 0 6
Table1. Cross-tabulations: Relationships between alcohol use (alc_use) and going out with friends (goout)
ta1<-xtabs(~alc_use+studytime, data=dta)
ta1
## studytime
## alc_use 1 2 3 4
## 1 22 74 31 17
## 1.5 21 33 10 4
## 2 17 26 13 2
## 2.5 11 25 5 1
## 3 12 18 1 1
## 3.5 11 4 2 0
## 4 4 4 0 1
## 4.5 0 3 0 0
## 5 5 3 0 1
Table 2. Cross-tabulations: relationships between alcohol use (alc_use) and weekly study time (studytime)
ta2<-xtabs(~alc_use+age, data=dta)
ta2
## age
## alc_use 15 16 17 18 19 20 22
## 1 46 41 26 27 3 1 0
## 1.5 9 24 20 12 3 0 0
## 2 8 14 19 16 1 0 0
## 2.5 8 9 16 8 1 0 0
## 3 4 9 8 8 3 0 0
## 3.5 3 3 5 6 0 0 0
## 4 2 4 2 1 0 0 0
## 4.5 0 0 2 1 0 0 0
## 5 1 3 2 2 0 0 1
Table 3. Cross-tabulations: relationships between alcohol use (alc_use) and age
ta3<-xtabs(~alc_use+failures, data=dta)
ta3
## failures
## alc_use 0 1 2 3
## 1 127 10 1 6
## 1.5 60 6 0 2
## 2 47 6 4 1
## 2.5 36 3 2 1
## 3 20 4 4 4
## 3.5 10 5 0 2
## 4 6 3 0 0
## 4.5 3 0 0 0
## 5 7 1 0 1
Table 4. Cross-tabulations: relationships between alcohol use and number of past class failures (failures)
library(ggplot2)
alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = goout))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("goout")+aes(col=alc_use)+ggtitle("Going out with friends by alcohol use")
Figure 2. Box plots I
library(ggplot2)
alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = studytime))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("studytime")+aes(col=alc_use)+ggtitle("Students' weekly study time by alcohol use")
Figure 3. Box plots II
library(ggplot2)
alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = age))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("age")+aes(col=alc_use)+ggtitle("Student age by alcohol use")
Figure 4. Box plots III
library(ggplot2)
alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = failures))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("failures")+aes(col=alc_use)+ggtitle("Number of past class failures by alcohol use")
Figure 5. Box plots IV
4.2. Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
Distributions
Figure 1 showed distributions of the variables:
Distributions of goout showed that students often go out with friends on the middle level (3, 1=very low and 5=very high).
Distributions of studytime showed that students weekly study time was often 2 hours.
Distributions of age indicated that students’ age were often between 15 to 18 and only a few students aged between 18 to 20.
Distributions of failures showed that students’ number of past class failures often was 0.
Dependent variables
Distributions of alc_use showed that most of students showed that they weekday and weekend alcohol use was very low (1).
Distributions of high_use showed that most of students showed lower than 2 alcohol use.
Relationship
The figure 1 also showed that our hypotheses were quite true. There were positive correlation between goout and alc_use (r=.388) (H1), negative correlation between studytime and alc_use (r=-.24) (H2), positive correlation between age and alc_use (r=.156) (H3), and positive correlation between failures and alc_use (r=.15) (H4).
However, cross-tabulations and box plots describe the relationship more deeply; showing that there are also a lot of variations.
Table 1 and also figure 1 showed that most students going out about average (2 or 3) have quite low alcohol use, while most of those students going out very high level have also high alcohol use.
Table 2 and also figure 2 showed that most of those students studying 2 hours have quite low alcohol use. Noted also that most of those students using alcohol in high level often studied 1 or 2 hours, while often students studying over 2 hours have low alcohol use with a few exceptions.
Table 3 and also figure 3 showed that students aged around 16 often have low alcohol use, while age level are little bit higher (in average) if students used alcohol in higher level (see figure 4) with a few exceptions.
Table 4 and also figure 4 showed that students without failures often showed low alcohol use, but also high alcohol use with a few exceptions indicating this quite low correlation (r=.15).
5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.
# find the model with glm()
m <- glm(high_use ~ goout+ studytime+age+failures, data = dta, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + studytime + age + failures,
## family = "binomial", data = dta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8036 -0.7813 -0.5332 0.8684 2.5526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.76790 1.84119 -2.046 0.04071 *
## goout 0.70987 0.11705 6.064 1.32e-09 ***
## studytime -0.59020 0.16849 -3.503 0.00046 ***
## age 0.09935 0.11028 0.901 0.36764
## failures 0.17576 0.17195 1.022 0.30672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 394.64 on 377 degrees of freedom
## AIC: 404.64
##
## Number of Fisher Scoring iterations: 4
Interpret a summary of the fitted model.
Below, you found summary of the fitted model. As the z-value are only significant (enough big) for goout and studytime, the corresponding variables only matters (the corresponding true regression coefficient is not 0). In the other words, only relationships between goout and high_use, and studytime and high_use are significant.
Noted also that that the standard errors of these four study variables were quite low (below 1); indicating that they are not too spread out.
# print out the coefficients of the model
coef(m)
## (Intercept) goout studytime age failures
## -3.76789872 0.70986928 -0.59020183 0.09935216 0.17575616
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI<-exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02310055 0.0005942399 0.8239749
## goout 2.03372540 1.6262645163 2.5759180
## studytime 0.55421541 0.3939867475 0.7640697
## age 1.10445518 0.8902994937 1.3733059
## failures 1.19214732 0.8494144571 1.6720407
Interpret the coefficients of the model as odds ratios
We only interpret the significant variables:
The odds ratios OR=2.03 for going out with friends was higher than one; indicating positive relationship. More student go out with friends, the more his/her alcohol use is at a high level (TRUE). In other words, if students go out with friends often, he/she have about 2 times likely to have high alcohol use.
The odds ratios OR=0.55 for studytime was lower than one; indicating negative relationship. Less student study, the more his/her alcohol use is at a high level (TRUE). In other words, if students studytime is low level, he/she have about 0.55 times likely to have high alcohol use.
compare them to your previously stated hypothesis.
Only hypothesis H1 (Going out with friends might have a positive association with alcohol consumption) and H2 (Weekly study time might have a negative association with alcohol consumption) were evident when these four dependent variables were at the same logistic regression model (controlling the each other’s effects).
6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
Noted that only variables that have significant relationship with high/low alcohol consumptions were left in the next model.
# fit the model
m <- glm(high_use ~ goout + studytime, data = dta, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
dta <- mutate(dta, probability = probabilities)
# use the probabilities to make a prediction of high_use
dta <- mutate(dta, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = dta$high_use, prediction = dta$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 22
## TRUE 70 42
table(high_use = dta$high_use, prediction = dta$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64921466 0.05759162 0.70680628
## TRUE 0.18324607 0.10994764 0.29319372
## Sum 0.83246073 0.16753927 1.00000000
Above you found probabilities of the target variables versus prediction. Indicating that our model with the two variables (goout and studytime) predicted correctly 65% of the FALSE values (low alcohol use) and 11% of the TRUE values (high-alcohol use).
7. Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error). Could you find such a model?
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(dta$high_use, dta$probability)
## [1] 0.2408377
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = dta, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2408377
Above you found the 10-fold cross-validation syntax and error as output. The data is divided into subsets 10 times and eventually all the data is used for both training and testing.
My model has about 0.24 training and testing errors, and thus my model is better than the model introduced in DataCamp (which had about 0.26 error)
8. Super-Bonus: Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.
MODEL1
First model with very high number of predictors
# fit the model
m <- glm(high_use ~ school+ age+ failures + absences + sex+internet+schoolsup+famsup+paid+activities+higher+romantic+Medu+Fedu+traveltime+studytime+famrel+freetime+goout+health+G1+G2+G3, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ school + age + failures + absences +
## sex + internet + schoolsup + famsup + paid + activities +
## higher + romantic + Medu + Fedu + traveltime + studytime +
## famrel + freetime + goout + health + G1 + G2 + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7676 -0.6984 -0.4146 0.5511 2.6452
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.47222 2.71628 -1.646 0.099671 .
## schoolMS 0.07936 0.49215 0.161 0.871888
## age 0.07185 0.13871 0.518 0.604479
## failures 0.16623 0.21566 0.771 0.440820
## absences 0.06836 0.01808 3.781 0.000156 ***
## sexM 0.94855 0.31300 3.031 0.002441 **
## internetyes -0.13830 0.40076 -0.345 0.730020
## schoolsupyes -0.18560 0.44390 -0.418 0.675869
## famsupyes -0.20794 0.31073 -0.669 0.503368
## paidyes 0.62269 0.29995 2.076 0.037897 *
## activitiesyes -0.50000 0.28663 -1.744 0.081090 .
## higheryes -0.22285 0.69263 -0.322 0.747653
## romanticyes -0.32750 0.31513 -1.039 0.298687
## Medu -0.16788 0.17488 -0.960 0.337051
## Fedu 0.21586 0.17454 1.237 0.216186
## traveltime 0.47690 0.21231 2.246 0.024689 *
## studytime -0.27991 0.19194 -1.458 0.144757
## famrel -0.48392 0.15708 -3.081 0.002065 **
## freetime 0.20776 0.15361 1.352 0.176228
## goout 0.78514 0.14134 5.555 2.78e-08 ***
## health 0.15662 0.10301 1.520 0.128420
## G1 -0.17182 0.09110 -1.886 0.059289 .
## G2 0.11897 0.11396 1.044 0.296508
## G3 0.03662 0.08252 0.444 0.657196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 340.34 on 358 degrees of freedom
## AIC: 388.34
##
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
## (Intercept) schoolMS age failures absences
## -4.47222082 0.07936423 0.07184607 0.16623121 0.06836059
## sexM internetyes schoolsupyes famsupyes paidyes
## 0.94854610 -0.13830143 -0.18559565 -0.20793732 0.62268537
## activitiesyes higheryes romanticyes Medu Fedu
## -0.50000158 -0.22284511 -0.32749905 -0.16788349 0.21586182
## traveltime studytime famrel freetime goout
## 0.47690408 -0.27990842 -0.48391665 0.20775575 0.78514099
## health G1 G2 G3
## 0.15661541 -0.17181500 0.11897195 0.03662077
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 17
## TRUE 54 58
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1858639
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2460733
As the training and testing error in MODEL1 are lower than our first study model with the four independent variables (training=testing error=0.241), we are moving forward.
MODEL2
In this second model, only significant variables will be let in the model2: absences, sex, paid, activities, traveltime, famrel, goout, and G1.
# fit the model
m <- glm(high_use ~ absences + sex+paid+activities+traveltime+famrel+goout+G1, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + paid + activities +
## traveltime + famrel + goout + G1, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0999 -0.7200 -0.4520 0.6456 2.9602
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.96357 0.89179 -3.323 0.00089 ***
## absences 0.06639 0.01678 3.956 7.63e-05 ***
## sexM 1.20658 0.27761 4.346 1.38e-05 ***
## paidyes 0.46782 0.26894 1.739 0.08195 .
## activitiesyes -0.57222 0.27011 -2.118 0.03414 *
## traveltime 0.45079 0.18804 2.397 0.01651 *
## famrel -0.45235 0.14481 -3.124 0.00179 **
## goout 0.80909 0.13091 6.180 6.39e-10 ***
## G1 -0.04001 0.04087 -0.979 0.32757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 356.39 on 373 degrees of freedom
## AIC: 374.39
##
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
## (Intercept) absences sexM paidyes activitiesyes
## -2.96356793 0.06638932 1.20658118 0.46781964 -0.57222077
## traveltime famrel goout G1
## 0.45079294 -0.45234746 0.80908756 -0.04001193
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 256 14
## TRUE 58 54
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1884817
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241
As the testing error in the MODEL2 is lower than in the MODEL1, we are moving forward.
MODEL3
Only significant will be let in the third model: absences, sex, paid, activities, traveltime, famrel, and goout.
# fit the model
m <- glm(high_use ~ absences + sex+paid+activities+traveltime+famrel+goout, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + paid + activities +
## traveltime + famrel + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0675 -0.7223 -0.4531 0.6708 2.8999
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.45743 0.74523 -4.639 3.49e-06 ***
## absences 0.06708 0.01674 4.007 6.16e-05 ***
## sexM 1.18030 0.27583 4.279 1.88e-05 ***
## paidyes 0.44611 0.26714 1.670 0.09494 .
## activitiesyes -0.59722 0.26886 -2.221 0.02633 *
## traveltime 0.47011 0.18649 2.521 0.01171 *
## famrel -0.45315 0.14489 -3.128 0.00176 **
## goout 0.83028 0.12985 6.394 1.61e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 357.35 on 374 degrees of freedom
## AIC: 373.35
##
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
## (Intercept) absences sexM paidyes activitiesyes
## -3.45743452 0.06707767 1.18030243 0.44610789 -0.59722408
## traveltime famrel goout
## 0.47011168 -0.45314583 0.83027754
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 255 15
## TRUE 56 56
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1858639
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2041885
As the training and testing error in the MODEL3 are slightly lower than in the MODEL2, this MODEL3 might be better.
MODEL4
As the all predictors were significant with the exeption of paid (p=0.10), we also tested the MODEL4 in which this paid variable was eliminated.
# fit the model
m <- glm(high_use ~ absences + sex+activities+traveltime+famrel+goout, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + activities + traveltime +
## famrel + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9612 -0.7083 -0.4746 0.7186 2.8026
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.16085 0.71564 -4.417 1.00e-05 ***
## absences 0.06634 0.01678 3.954 7.69e-05 ***
## sexM 1.11730 0.27107 4.122 3.76e-05 ***
## activitiesyes -0.58263 0.26744 -2.179 0.02937 *
## traveltime 0.45011 0.18555 2.426 0.01528 *
## famrel -0.45128 0.14424 -3.129 0.00176 **
## goout 0.82124 0.12879 6.377 1.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 360.17 on 375 degrees of freedom
## AIC: 374.17
##
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
## (Intercept) absences sexM activitiesyes traveltime
## -3.16085491 0.06633946 1.11730440 -0.58262984 0.45011152
## famrel goout
## -0.45128170 0.82124209
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 17
## TRUE 58 54
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1963351
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2146597
As the training and testing error in the MODEL4 are higher than in the MODEL3, this MODEL3 might be one of the best model to describe the high alcohol use. In this model, absences (number of school absences), sex, paid (extra paid classes), activities (extra-curriculum activities), traveltime (home to school traveltime), famrel (quality of family relationships) and goout (going out of friends) describe high alcohol use.
Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.
predictors<-c(23, 8, 7, 6)
training_error<-c(0.186, 0.188, 0.186, 0.196)
qplot(x=predictors, y=training_error)
predictors<-c(23, 8, 7, 6)
testin_error<-c(0.230, 0.204, 0.199, 0.217)
qplot(x=predictors, y=testin_error)
2. Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim (Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Describe the dataset briefly
crim=per capita crime rate by town.
zn=proportion of residential land zoned for lots over 25,000 sq.ft.
indus=proportion of non-retail business acres per town.
chas=Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox=nitrogen oxides concentration (parts per 10 million).
rm=average number of rooms per dwelling.
age=proportion of owner-occupied units built prior to 1940.
dis= weighted mean of distances to five Boston employment centres.
rad=index of accessibility to radial highways.
tax=full-value property-tax rate per $10,000.
ptratio=pupil-teacher ratio by town.
black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat=lower status of the population (percent).
medv=median value of owner-occupied homes in $1000s.
3. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# a more advanced plot matrix with ggpairs()
p <- ggpairs(Boston, mapping = aes(),lower = list(combo = wrap("facethist", bins = 30)))
# draw the plot
p
Figure 1. Distributions.
# MASS, corrplot, tidyverse and Boston dataset are available
library(corrplot)
library(dplyr)
library(MASS)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix%>%round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Distributions
Figure 1 showed distributions of the variables.
Only RM (average number of rooms per dwelling) and medv (median value of owner-occupied homes in $1000s) were close to the normal distribution.
Variables: crim (per capita crime rate by town), zn (proportion of residential land zoned), chas (Charles River dummy variable), nox (nitrogen oxides concentration), dis (weighted mean of distances to five Boston employment centres), rad (index of accessibility to radial highways) and Istat (lower status of the population) were mostly skewed to the left, especially crim, zn and chas got real small values.
Variables: age (proportion of owner-occupied units built prior to 1940), ptratio (pupil-teacher ratio by town), and black (proportion of blacks by town) were skewed to the right.
Variable: indus (proportion of non-retail business acres per town) has two peak values.
Relationship
Next, we describe relationships between variables by using correlation table and corrplot (above). In general, there were quite high correlations between variables. Only chas variable showed low correlation with other variables. Below, we describe only some of the high correlation between variables. From the top you will find detailed descriptions of the variables and below are shown only abbreviations.
Variable crim have high positive correlation with rad and tax.
Variable zn have high positive correlation with dis and negative correlation with indus, nox, and age.
Variable indus also have high positive correlations with several variables such as nox and tax and high negative correlation with dis.
Variable chas have no significant correlations with any of the variables.
Variable nox also have high positive correlations with several variables such as age and high negative correlation with dis.
Variable rm have high positive correlation with lstat and high negative correlation with medv.
Variable age also have high positive correlation with dis and high negative correlation with lstat.
Variable rad also have very high positive correlation with tax.
Variable tax also have quite high positive correlation with lstat.
Variable ptratio also have quite high negative correlation with medv.
Variable lstat also have high negative correlation with medv.
4. Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset. Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
Standardize the dataset and print out summaries of the scaled data.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
How did the variables change?
In the scaling, we subtract the column means from the corresponding columns and divide the difference with standard deviation. Note that the means of the variables are always zero. This scaling will help to reach the normality distributions (1. assumptions for linear discriminant analysis (LDA)) and that the variables have the same variance (2. assumption for LDA).
Create a categorical variable of the crime rate in the Boston dataset. Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
# summary of the scaled_crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
crime
## [1] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [5] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [9] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [13] (-0.411,-0.39] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [17] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [21] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [25] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [29] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [33] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] [-0.419,-0.411]
## [37] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39] [-0.419,-0.411]
## [41] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [45] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [49] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411]
## [53] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [57] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [61] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [65] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [69] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [73] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39]
## [77] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39]
## [81] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [85] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [89] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [93] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]
## [97] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411]
## [101] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [105] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [109] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39] (-0.411,-0.39]
## [113] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [117] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [121] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [125] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.39,0.00739]
## [129] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [133] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [137] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739]
## [141] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (0.00739,9.92]
## [145] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [149] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [153] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [157] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [161] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [165] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [169] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [173] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411]
## [177] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [181] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [185] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [189] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411]
## [193] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [197] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [201] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [205] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [209] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739]
## [213] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39]
## [217] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [221] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [225] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [229] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [233] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [237] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.411,-0.39]
## [241] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [245] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39]
## [249] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [253] (-0.411,-0.39] (-0.39,0.00739] [-0.419,-0.411] [-0.419,-0.411]
## [257] [-0.419,-0.411] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [261] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [265] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [269] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39]
## [273] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39]
## [277] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]
## [281] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [285] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [289] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [293] [-0.419,-0.411] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39]
## [297] [-0.419,-0.411] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411]
## [301] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [305] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [309] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [313] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]
## [317] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739] (-0.39,0.00739]
## [321] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.39,0.00739]
## [325] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39]
## [329] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [333] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [337] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [341] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [345] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [349] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [353] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]
## [357] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [361] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [365] (-0.39,0.00739] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [369] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [373] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [377] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [381] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [385] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [389] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [393] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [397] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [401] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [405] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [409] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [413] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [417] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [421] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [425] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [429] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [433] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [437] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [441] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [445] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [449] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [453] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [457] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [461] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [465] (0.00739,9.92] (-0.39,0.00739] (0.00739,9.92] (0.00739,9.92]
## [469] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [473] (-0.39,0.00739] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [477] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [481] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (-0.39,0.00739]
## [485] (-0.39,0.00739] (-0.39,0.00739] (0.00739,9.92] (0.00739,9.92]
## [489] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [493] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.411,-0.39]
## [497] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.411,-0.39]
## [501] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [505] (-0.411,-0.39] [-0.419,-0.411]
## 4 Levels: [-0.419,-0.411] (-0.411,-0.39] ... (0.00739,9.92]
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
5. Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
# linear discriminant analysis, crime rate as the target variable and all other as predictors
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2648515 0.2425743 0.2524752 0.2400990
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 0.9822034 -0.9452053 -0.125147750 -0.9016656 0.46179974
## (-0.411,-0.39] -0.1109714 -0.2085992 0.008892378 -0.5283735 -0.17157673
## (-0.39,0.00739] -0.3930786 0.1348878 0.268057239 0.3714282 0.09021303
## (0.00739,9.92] -0.4872402 1.0149946 -0.069385757 1.0444341 -0.45801565
## age dis rad tax ptratio
## [-0.419,-0.411] -0.8797402 0.9089842 -0.6920711 -0.7510687 -0.41762332
## (-0.411,-0.39] -0.3131095 0.3319144 -0.5564696 -0.4685724 -0.05063187
## (-0.39,0.00739] 0.4085632 -0.3504009 -0.4256529 -0.3409495 -0.25705681
## (0.00739,9.92] 0.8036562 -0.8736526 1.6596029 1.5294129 0.80577843
## black lstat medv
## [-0.419,-0.411] 0.37877049 -0.78607880 0.51925093
## (-0.411,-0.39] 0.31668689 -0.06398780 -0.03463271
## (-0.39,0.00739] 0.06848728 0.04604351 0.17978139
## (0.00739,9.92] -0.75575451 0.90523789 -0.70728473
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09199132 0.71513647 -0.818745352
## indus 0.09003160 -0.18905273 0.629688407
## chas -0.06991944 -0.12674189 0.076560330
## nox 0.15144264 -0.81522606 -1.407367601
## rm -0.12341972 -0.08162290 -0.200208619
## age 0.24036697 -0.28269812 -0.237220463
## dis -0.11279830 -0.33930493 0.197408887
## rad 3.67460209 0.92116041 0.002254651
## tax 0.01293590 0.02667671 0.408663565
## ptratio 0.11169457 -0.03761211 -0.311034958
## black -0.14049026 0.02069159 0.148311804
## lstat 0.15424961 -0.39626524 0.314750754
## medv 0.13531526 -0.46166547 -0.225911036
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9533 0.0352 0.0115
In the results above, you can see for example that linear discriminant function1 (LD1) explained 95% of the between groups (classes) variance. So, it is the most important function of distinguishing between classes. Furthermore, it seems the variable rad is the best linear separator for the crime classes (groups).
# the function for lda biplot arrows for LDA (bi)plot drawing
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
In the LDA biplot, you can see how each class is distinguished from others as colors. Only rad is powerful linear separator marked by the arrow based on the coefficient.
6. Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739]
## [-0.419,-0.411] 12 7 1
## (-0.411,-0.39] 6 16 6
## (-0.39,0.00739] 0 3 19
## (0.00739,9.92] 0 0 1
## predicted
## correct (0.00739,9.92]
## [-0.419,-0.411] 0
## (-0.411,-0.39] 0
## (-0.39,0.00739] 2
## (0.00739,9.92] 29
Comment on the results.
As you can see in table above, we manage to correctly classify 13% of the observations in the first group (the interval: [-.419, .411], 13% of the observations in the second group (the interval: (-.411, -.39], 15% of the observations in the third group (the interval: (-.39, .00739], and 28% of the observations in the fourth group (the interval: (.00739, 9.92].
When knitting these percentage change and above you find them:)?!
7. Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
Reload the Boston dataset and standardize the dataset.
# access the MASS package
library(MASS)
# load the data
data("Boston")
# center and standardize variables
boston_stand <- scale(Boston)
Calculate the distances between the observations.
# euclidean distance matrix
dist_eu <- dist(boston_stand)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(boston_stand, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
centers=2
# 1. Centers=6
# Euclidean distance matrix
dist_eu <- dist(boston_stand)
# k-means clustering
km <-kmeans(dist_eu, centers =2)
# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)
set.seed(123)
centers=3
# 1. Centers=9
# Euclidean distance matrix
dist_eu <- dist(boston_stand)
# k-means clustering
km <-kmeans(dist_eu, centers =3)
# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)
set.seed(123)
centers=4.
#1. Centers=10
# Euclidean distance matrix
dist_eu <- dist(boston_stand)
# k-means clustering
km <-kmeans(dist_eu, centers =4)
# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)
set.seed(123)
Interpret the results.
Pair figures in which clusters are separated with color showed that the optimal number of clusters might be three. As the data points did not change cluster when centroid is 4. In turn, when centroid is 3, the data point changed the cluster (2->3). Hence, we suggested that the optimal number of clusters are 3.
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset.
Based on above we choose centers=4 and number of clusters = 4 (over optimal).
# access the MASS package
library(MASS)
# load the data
data("Boston")
# center and standardize variables
boston_stand <- scale(Boston)
set.seed(123)
# euclidean distance matrix
dist_eu <- dist(boston_stand)
# determine the number of clusters
k_max <- 4
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
# k-means clustering
km <-kmeans(dist_eu, centers = 4)
# plot the Boston dataset with clusters
pairs(boston_stand, col = km$cluster)
Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model.
# linear discriminant analysis, clusters as target classes and all other as predictors
lda2.fit <- lda(km$cluster~ ., data = Boston)
# print the lda.fit object
lda2.fit
## Call:
## lda(km$cluster ~ ., data = Boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2272727 0.1304348 0.2114625 0.4308300
##
## Group means:
## crim zn indus chas nox rm age
## 1 6.0201918 0.000000 19.295565 0.0000000 0.6589652 6.090009 90.29652
## 2 15.9401909 0.000000 18.470303 0.1818182 0.7104242 5.760379 92.71364
## 3 0.2484426 40.915888 5.130748 0.2149533 0.4694196 6.982925 49.01308
## 4 0.2636922 6.293578 7.560505 0.0000000 0.4943982 6.203284 59.40963
## dis rad tax ptratio black lstat medv
## 1 2.168761 17.391304 582.0261 19.75652 355.1060 17.020696 17.50783
## 2 1.769233 20.818182 626.8333 19.36515 205.5244 21.173030 15.00000
## 3 5.427185 4.355140 300.8879 16.36449 387.8884 6.797196 32.82617
## 4 4.465165 4.550459 303.0688 18.52018 387.9413 10.643807 22.41193
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.021057935 -0.058271581 0.0703771294
## zn -0.018564727 -0.044961797 -0.0289018272
## indus -0.200796084 0.043976280 -0.1560182615
## chas 0.169607803 -2.991498755 0.8838096601
## nox -9.033201420 -3.331966520 2.8710406785
## rm 0.212247438 -0.214962952 -0.9669930466
## age 0.003516101 0.001858370 -0.0093380639
## dis -0.062398001 -0.075668875 0.0165639237
## rad -0.075596848 0.059603084 -0.0552930235
## tax -0.001714967 -0.003425927 -0.0006141385
## ptratio -0.102713176 0.077073413 0.0424108370
## black 0.004680516 0.006401206 -0.0098438564
## lstat -0.034057442 -0.086790743 0.0015673331
## medv -0.023878793 -0.103139302 0.0185551449
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution)
# the function for lda biplot arrows for LDA (bi)plot drawing
lda2.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results
plot(lda2.fit, dimen = 2, col = classes, pch = classes)
lda2.arrows(lda2.fit, myscale = 1)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
Interpret the results. Which variables are the most influencial linear separators for the clusters?
In the results above, you can see for example that linear discriminant function1 (LD1) explained 76% of the between clusters variance. So, it is the most important function of distinguishing between clusters. Noted also that linear discriminant2 (LD2) explained 18% of the between clusters variance and LD3 did not explained so much of the variance (.06%). So assumption of three clusters is quite correct.
Furthermore, it seems that the variable nox is the most influencial linear separators for clusters regarding LD1 and also LD2; and second highest influencial is chas regarding LD2.
In the LDA biplot, you can see how each class is distinguished from other as colors. As we mentioned only nox and chas are influencial linear separator marked by the arrow based on the coefficient.
Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
crimeclass = train$crime
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = crimeclass)
Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)
Figures were very similar.
The 3d plot where the color is defined by the crimeclass showed very clearly different groups. This separate groups by colors. As mentioned above the group (.00739, 9.92] can be best to differentiate.
In turn, the 3D where the color is defined by the clusters of the K-means did not show the color and groups are hard to identify.